Data Preparation for Hypoxia Experiment of H358 Cell Lines¶
- License: Creative Commons Attribution-NonCommercial 4.0 International License
- Version: 0.1
- Edit Log:
- 2025-11-28: Initial version of the notebook
Requirements:
- Raw DIA-NN output:
report.parquet,report.pg_matrix.tsv - Reference databases: UniProt FASTA, UniProt annotations, iPTMNet, MEROPS
- Prior notebook: None (this is the first notebook in the NSCLC analysis pipeline)
Data Information:
- Cell Line: H358 (Non-Small Cell Lung Cancer)
- Conditions: True Hypoxia (1% O₂), Normoxia (21% O₂)
- Timepoints: 48hr, 72hr
- Replicates: 4 biological replicates per condition-timepoint
- Total Samples: 16 (before QC filtering)
Purpose:
Prepare peptide-level quantitative proteomics data for downstream proteoform analysis:
- Load and integrate reference databases (FASTA, UniProt, iPTMNet, MEROPS)
- Parse experimental metadata and quantitative data from DIA-NN output
- Map peptides to protein sequences with positional information
- Apply quality control filters (overlap removal, outlier detection)
- Normalize and impute missing values
- Calculate protein coverage metrics and save cleaned data
01. Setup¶
This section imports required libraries, configures display settings, and defines paths for data and figures.
Note: The HTML rendering of this notebook hides code cells by default. Click the "Code" buttons on the right to expand them.
01.1 Libraries¶
# Libraries Used
import re
import os
import sys
import numpy as np # Numerical computing
import pandas as pd # Data manipulation
import seaborn as sns # R-like high-level plots
import matplotlib.pyplot as plt # Python's base plotting
# home path
sys.path.append('../')
from src import utils, plots
from ProteoForge import annotate, parse
from ProteoForge import process, coverage
from ProteoForge import normalize, impute
# Initialize the timer
startTime = utils.getTime()
01.2 Configure Notebook Settings¶
Configure visualization styles, color palettes, and display options for consistent output formatting throughout the analysis.
## Configure Plotting Settings
def_colors = [
"#139593", "#fca311", "#e54f2a",
"#c3c3c3", "#555555",
"#690000", "#5f4a00", "#004549"
]
condition_colors = {
'True Hypoxia (1% O2) - 48 Hr': '#e9c46a',
'True Hypoxia (1% O2) - 72 Hr': '#f4a261',
'Normoxia (21% O2) - 48 Hr': '#ade8f4',
'Normoxia (21% O2) - 72 Hr': '#0077b6',
# Won't be used since the manuscript didn't cover these conditions
# 'Chemical Hypoxia (CoCl2) - 48 Hr': '#c77dff',
# 'Chemical Hypoxia (CoCl2) - 72 Hr': '#8338ec',
# 'Oxidative Stress (H2O2) - 48 Hr': '#ff006e',
# 'Oxidative Stress (H2O2) - 72 Hr': '#fb5607',
}
cv_group_palettes = {
"<10%": "#081c15",
"<25%": "#1b4332",
"<50%": "#2d6a4f",
"<100%": "#52b788",
">100%": "#95d5b2",
}
# Set seaborn style
sns.set_theme(
style="white",
context="paper",
palette=def_colors,
font_scale=1,
rc={
"figure.figsize": (6, 4),
"font.family": "sans-serif",
"font.sans-serif": ["Arial", "Ubuntu Mono"],
}
)
# Figure Saving Settings
figure_formats = ["png", "pdf"]
save_to_folder = True
transparent_bg = True
figure_dpi = 300
## Configure dataframe displaying
pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500) # Set a wider display width
## Printing Settings
verbose = True
plots.color_palette( def_colors, save=False )
plots.color_palette( cv_group_palettes, name="CVGroups", save=False )
plots.color_palette( condition_colors, save=False, name='condition', size=2.5)
01.3 Data and Result Paths¶
Defines input/output directory structure:
- Input:
./data/input/hypoxia/— DIA-NN output files - References:
./data/input/— Reference tables such as FASTA and annotations - Output:
./data/cleaned/hypoxia/— Processed data tables - Figures:
./figures/hypoxia/01-DataPreparation/— QC visualizations
home_path = './'
data_name = "hypoxia"
notebook_name = "01-DataPreparation"
data_path = f"{home_path}data/"
input_path = f"{data_path}input/{data_name}/"
output_path = f"{data_path}cleaned/{data_name}/"
figure_path = f"{home_path}figures/{data_name}/{notebook_name}/"
# Create the output folder if it does not exist
if not os.path.exists(output_path):
os.makedirs(output_path)
# Create figure folder structure, if needed
if save_to_folder:
for i in figure_formats:
cur_folder = figure_path + i + "/"
if not os.path.exists(cur_folder):
os.makedirs(cur_folder)
02. Load Datasets¶
This section loads all required reference databases and experimental data. Multiple data sources are integrated to enable comprehensive proteoform analysis.
02.1 FASTA Reference Sequences¶
The UniProt human reference proteome (UP000005640) provides canonical protein sequences for peptide mapping. Specifically the 20230316_HumanCr_20806.fasta file that is from the Hypoxia project folder is loaded and parsed.
fasta_data = parse.fasta_to_df(
fasta_path = f'./data/input/20230316_HumanCr_20806.fasta',
gene_only = False,
column_order=['entry', 'geneName', 'proteinDescription', 'sequenceLength', 'molecularWeight_kDa', 'sequence']
).rename(
columns={
'entry': 'Protein',
'geneName': 'Gene',
'proteinDescription': 'Description',
'sequenceLength': 'Length',
'molecularWeight_kDa': 'Weight(kDa)',
'sequence': 'Sequence'
}
)
fasta_data.head()
Processing FASTA file: ./data/input/20230316_HumanCr_20806.fasta Auto-selected: multiprocessing with 4 processes Loading FASTA entries... Processing 26855 entries in chunks of 2828 Processed 26772 entries, skipped 83 - invalid_sequence: 78 - length_filter: 5 Generated DataFrame with 26772 rows and 6 columns
| Protein | Gene | Description | Length | Weight(kDa) | Sequence | |
|---|---|---|---|---|---|---|
| 0 | A0A024B7W1 | None | Genome polyprotein | 3423 | 378.8692 | MKNPKKKSGGFRIVNMLKRGVARVSPFGGLKRLPAGLLLGHGPIRM... |
| 1 | A0A024R1R8 | TMA7B | Translation machinery-associated protein 7B | 64 | 7.0870 | MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAK... |
| 2 | A0A024RBG1 | NUDT4B | Diphosphoinositol polyphosphate phosphohydrola... | 181 | 20.4213 | MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQW... |
| 3 | A0A075B6H7 | IGKV3-7 | Probable non-functional immunoglobulin kappa v... | 116 | 12.7754 | MEAPAQLLFLLLLWLPDTTREIVMTQSPPTLSLSPGERVTLSCRAS... |
| 4 | A0A075B6H8 | IGKV1D-42 | Probable non-functional immunoglobulin kappa v... | 117 | 13.0056 | MDMRVPAQLLGLLLLWLPGVRFDIQMTQSPSFLSASVGDRVSIICW... |
The FASTA reference contains ~20,000 human protein sequences with UniProt accessions, gene names, descriptions, lengths, molecular weights, and full amino acid sequences. These will be used for peptide-to-protein position mapping, and other relevant analyses.
02.2 UniProt Annotations¶
Here we prepared the UniProt annotations file to assist with the mapping of any additional protein information to make sense of the proteoforms or PTMs we observe in the data. The annotate module from proteoforge is used to load and parse the uniprotkb_Human_AND_model_organism_9606_2025_07_22.txt file. Additionaly functions makes some annotations more accessible by additional processing steps.
# Parse the UniProt features data from the large text file
print("\n--- 📄 Parsing UniProt Features Data ---")
uniprot_data = annotate.preprocess_uniprot(f"./data/input/uniprotkb_Human_AND_model_organism_9606_2025_07_22.txt")
# Expand the uniprot
uniprot_data.columns = ['Protein', 'feature', 'isoform', 'start', 'end', 'description']
# Set up additional columns for further information keeping
uniprot_data['agent'] = ''
uniprot_data['note'] = ''
uniprot_data['group'] = ''
uniprot_data = uniprot_data[['isoform', 'Protein', 'feature', 'start', 'end', 'group', 'agent', 'note', 'description']]
uniprot_data['start'] = uniprot_data['start'].fillna(0).astype(int)
uniprot_data['end'] = uniprot_data['end'].fillna(uniprot_data['start']).astype(int)
uniprot_data.sort_values(by=['Protein', 'start', 'end'], inplace=True)
# Cleaning up and expanding
print("\n--- 🧹 Cleaning UniProt Features Data ---")
print("Processing the individual features for better info extraction...")
print(" - SITE to Cleavage and Breakpoints...")
uniprot_data = annotate.process_site_features(uniprot_data)
print(" - LIPID features...")
uniprot_data = annotate.process_lipid_features(uniprot_data)
print(" - CROSSLNK features...")
uniprot_data = annotate.process_crosslink_features(uniprot_data)
print(" - CARBOHYD features...")
uniprot_data = annotate.process_carbohydrate_features(uniprot_data)
print(" - DISULFID features...")
uniprot_data = annotate.process_disulfide_features(uniprot_data)
print(" - VAR_SEQ features...")
uniprot_data = annotate.process_varseq_features(uniprot_data)
print(" - MUTAGEN features...")
uniprot_data = annotate.process_mutagen_features(uniprot_data)
print(" - CONFLICT features...")
uniprot_data = annotate.process_conflict_features(uniprot_data)
print(" - MOD_RES features...")
uniprot_data = annotate.process_modres_features(uniprot_data)
print(" - VARIANT features...")
uniprot_data = annotate.process_variant_features(uniprot_data)
# Clean up the empty group, agent, note, and description columns
uniprot_data['group'] = uniprot_data['group'].fillna('')
uniprot_data['agent'] = uniprot_data['agent'].fillna('')
uniprot_data['note'] = uniprot_data['note'].fillna('')
uniprot_data['description'] = uniprot_data['description'].fillna('')
# Display the first few rows of the parsed UniProt features data
print("\nParsed UniProt Features Data Sample:")
print(uniprot_data.head())
print("\nUniProt Features by Occurrence:")
print(uniprot_data['feature'].value_counts())
--- 📄 Parsing UniProt Features Data ---
--- 🧹 Cleaning UniProt Features Data ---
Processing the individual features for better info extraction...
- SITE to Cleavage and Breakpoints...
- LIPID features...
- CROSSLNK features...
- CARBOHYD features...
- DISULFID features...
- VAR_SEQ features...
- MUTAGEN features...
- CONFLICT features...
- MOD_RES features...
- VARIANT features...
Parsed UniProt Features Data Sample:
isoform Protein feature start end group agent note description
1158389 A0A023HJ61 NON_TER 16 16
813523 A0A023I7H2 SIGNAL 1 21
813524 A0A023I7H2 CHAIN 22 603 NADH-ubiquinone oxidoreductase chain 5
813539 A0A023I7H2 DOMAIN 68 118 NADH-Ubiquinone oxidoreductase (complex I) cha...
813525 A0A023I7H2 TRANSMEM 84 105 Helical
UniProt Features by Occurrence:
feature
DOMAIN 205497
TRANSMEM 174613
NON_TER 167406
COMPBIAS 163351
REGION 131723
STRAND 93341
HELIX 92360
VARIANT 87787
BINDING 66264
MOD_RES 55836
CHAIN 42318
CONFLICT 35418
MUTAGEN 33909
VAR_SEQ 28741
DISULFID 27471
REPEAT 27354
SIGNAL 24658
TURN 23679
TOPO_DOM 20046
CARBOHYD 18542
COILED 13370
ZN_FING 9644
ACT_SITE 7586
CROSSLNK 7126
MOTIF 4160
SITE 2912
INIT_MET 1984
DNA_BIND 1775
LIPID 1209
CLEAVAGE 842
PROPEP 831
TRANSIT 513
PEPTIDE 425
INTRAMEM 422
BREAKPOINT 181
UNSURE 81
NON_STD 45
GLRA2 3
Name: count, dtype: int64
UniProt feature annotations have been parsed and processed, including domains, PTM sites, variants, and cleavage sites. Individual feature types (SITE, LIPID, CROSSLNK, CARBOHYD, DISULFID, VAR_SEQ, MUTAGEN, CONFLICT, MOD_RES, VARIANT) have been expanded for detailed annotation.
02.3 iPTMNet Database¶
iPTMNet is a comprehensive database of post-translational modifications (PTMs) integrating data from multiple sources. This database provides additional PTM context that is not available in UniProt alone, both ptms and their scores are downloaded from iPTMNet, parsed, and merged with the uniprot annotations to expand the PTM information.
merged_data = annotate.load_and_process_iptmnet(
ptm_path="./data/input/iPTMnet/ptm.txt",
score_path="./data/input/iPTMnet/score.txt",
verbose=verbose,
)
print(f"Shape of Merged PTM Data: {merged_data.shape}")
print("\nMerged PTM Data Sample:")
print(merged_data.head())
print("\nPTM Types by Occurrence:")
print(merged_data['ptm_type'].value_counts())
--- Loading and Processing ./data/input/iPTMnet/ptm.txt ---
Initial shape: (1055800, 10)
Filtered to Homo sapiens, new shape: (584690, 10)
Shape after cleaning: (584690, 10)
--- Loading and Processing ./data/input/iPTMnet/score.txt ---
Initial shape: (851361, 5)
Shape after cleaning: (851361, 5)
--- Aggregating Scores ---
Aggregated scores into 851,361 unique events.
--- Merging DataFrames and Finalizing Columns ---
Merged scores: 536,531 / 584,690 rows have a non-zero score.
Created 'start' and 'end' columns from 'site'.
Shape of Merged PTM Data: (584690, 13)
Merged PTM Data Sample:
ptm_type source substrate_protein substrate_gene organism site enzyme_protein enzyme_gene note pmid score start end
0 ACETYLATION iedb A8TX70 COL6A5 Homo sapiens (Human) K652 <NA> iedb:1797193 1 652 652
1 ACETYLATION iedb B2RTY4 MYO9A Homo sapiens (Human) K2393 <NA> iedb:1770492 1 2393 2393
2 ACETYLATION iedb O00159 MYO1C Homo sapiens (Human) K396 <NA> iedb:1837008 1 396 396
3 ACETYLATION iedb O14672 ADAM10 Homo sapiens (Human) K659 <NA> iedb:1772798 1 659 659
4 ACETYLATION iedb O15078 CEP290 Homo sapiens (Human) K1350 <NA> iedb:1781939 1 1350 1350
PTM Types by Occurrence:
ptm_type
PHOSPHORYLATION 393181
UBIQUITINATION 98115
ACETYLATION 27754
METHYLATION 22555
O-GLYCOSYLATION 22164
N-GLYCOSYLATION 9875
SUMOYLATION 8417
S-NITROSYLATION 2362
MYRISTOYLATION 138
C-GLYCOSYLATION 115
DIHYDROXYLATION 6
FARNESYLATION 3
CARBOXYLATION 2
S-GLYCOSYLATION 2
N-PHOSPHORYLATION 1
Name: count, dtype: int64
uniprot_data = annotate.add_iptmnet_to_uniprot(uniprot_data, merged_data)
print("\nUpdated UniProt Data with iPTMNet PTM Types:")
print(uniprot_data['feature'].value_counts())
print()
print("The score distribution (3=UniProt and Maybe iPTMNet, highest the better):")
print(uniprot_data['score'].value_counts())
print()
print(uniprot_data.head())
--- Integrating iPTMnet and UniProt Data ---
Added default score of 3 to 1,573,423 UniProt annotations.
Standardized 584,690 iPTMnet annotations.
Concatenated data. New total annotations: 2,158,113
Final DataFrame sorted by Protein, start, and end positions.
Updated UniProt Data with iPTMNet PTM Types:
feature
MOD_RES 501697
DOMAIN 205497
TRANSMEM 174613
NON_TER 167406
COMPBIAS 163351
REGION 131723
CROSSLNK 113658
STRAND 93341
HELIX 92360
VARIANT 87787
BINDING 66264
CARBOHYD 50698
CHAIN 42318
CONFLICT 35418
MUTAGEN 33909
VAR_SEQ 28741
DISULFID 27471
REPEAT 27354
SIGNAL 24658
TURN 23679
TOPO_DOM 20046
COILED 13370
ZN_FING 9644
ACT_SITE 7586
MOTIF 4160
SITE 2912
INIT_MET 1984
DNA_BIND 1775
LIPID 1350
CLEAVAGE 842
PROPEP 831
TRANSIT 513
PEPTIDE 425
INTRAMEM 422
BREAKPOINT 181
UNSURE 81
NON_STD 45
GLRA2 3
Name: count, dtype: int64
The score distribution (3=UniProt and Maybe iPTMNet, highest the better):
score
3 1597927
1 365128
2 115775
0 48159
4 31124
Name: count, dtype: int64
Protein feature start end group agent note description score isoform
0 A0A023HJ61 NON_TER 16 16 3
1 A0A023I7H2 SIGNAL 1 21 3
2 A0A023I7H2 CHAIN 22 603 NADH-ubiquinone oxidoreductase chain 5 3
3 A0A023I7H2 DOMAIN 68 118 NADH-Ubiquinone oxidoreductase (complex I) cha... 3
4 A0A023I7H2 TRANSMEM 84 105 Helical 3
02.4 MEROPS Protease Database¶
MEROPS is the comprehensive database of peptidases (proteases) and their substrates. Loading both family annotations and cleavage site details enables identification of potential proteolytic processing events. This enables a lot more cleavage and proteolytic event annotations for the peptides we observe in the data.
print("--- Processing parsed merops table ---")
merops_data = annotate.process_merops_data(f"./data/input/merops/mer.tab", verbose=verbose)
print()
print("MEROPS Families by Occurrence:")
print(merops_data['agent'].value_counts())
print()
print(f"Shape of MEROPS Data: {merops_data.shape}")
print(merops_data.head())
--- Processing parsed merops table ---
--- Processing MEROPS data from ./data/input/merops/mer.tab ---
Processed 67,781 MEROPS cleavage sites for Homo sapiens.
MEROPS Families by Occurrence:
agent
trypsin 1 20792
glutamyl endopeptidase I 4096
cathepsin S 3098
cathepsin L 2883
matrix metallopeptidase-3 2424
...
IgA1-specific serine peptidase type 2 ({Haemophilus} sp.) 1
hemorrhagic metallopeptidase b ({Trimeresurus mucrosquamatus}) 1
bothrolysin 1
IgA1-specific metallopeptidase 1
membrane Pro-Xaa carboxypeptidase 1
Name: count, Length: 586, dtype: int64
Shape of MEROPS Data: (67781, 10)
Protein feature start end group agent note description score isoform
0 P29401 CLEAVAGE 609 609 Proteolytic Cleavage DESC1 peptidase N-Ph cleavage at None609 by DESC1 peptidase MEROPS Cleavage Site; Protease: DESC1 peptidas... 2 NaN
1 O14950 CLEAVAGE 144 144 Proteolytic Cleavage DESC1 peptidase N-Ph cleavage at None144 by DESC1 peptidase MEROPS Cleavage Site; Protease: DESC1 peptidas... 2 NaN
2 P01011 CLEAVAGE 384 384 Proteolytic Cleavage matrix metallopeptidase-1 Ph cleavage at A384 by matrix metallopeptidase-1 MEROPS Cleavage Site; Protease: matrix metallo... 2 NaN
3 P01009 CLEAVAGE 374 374 Proteolytic Cleavage matrix metallopeptidase-11 Ph cleavage at A374 by matrix metallopeptidase-11 MEROPS Cleavage Site; Protease: matrix metallo... 2 NaN
4 Q15201 CLEAVAGE 1218 1218 Proteolytic Cleavage procollagen C-peptidase Ph cleavage at A1218 by procollagen C-peptidase MEROPS Cleavage Site; Protease: procollagen C-... 2 NaN
uniprot_data = annotate.add_merops_to_uniprot(uniprot_data, merops_data, verbose=verbose)
print("\nUpdated UniProt Data with iPTMNet PTM Types:")
print(uniprot_data['feature'].value_counts())
print()
print("The score distribution (3=UniProt and Maybe iPTMNet, highest the better):")
print(uniprot_data['score'].value_counts())
print()
print(uniprot_data.head())
--- Adding MEROPS data to UniProt DataFrame --- Added 67,781 MEROPS annotations. New total: 2,225,894 Final DataFrame sorted by Protein, start, and end positions. Updated UniProt Data with iPTMNet PTM Types: feature MOD_RES 501697 DOMAIN 205497 TRANSMEM 174613 NON_TER 167406 COMPBIAS 163351 REGION 131723 CROSSLNK 113658 STRAND 93341 HELIX 92360 VARIANT 87787 CLEAVAGE 68623 BINDING 66264 CARBOHYD 50698 CHAIN 42318 CONFLICT 35418 MUTAGEN 33909 VAR_SEQ 28741 DISULFID 27471 REPEAT 27354 SIGNAL 24658 TURN 23679 TOPO_DOM 20046 COILED 13370 ZN_FING 9644 ACT_SITE 7586 MOTIF 4160 SITE 2912 INIT_MET 1984 DNA_BIND 1775 LIPID 1350 PROPEP 831 TRANSIT 513 PEPTIDE 425 INTRAMEM 422 BREAKPOINT 181 UNSURE 81 NON_STD 45 GLRA2 3 Name: count, dtype: Int64 The score distribution (3=UniProt and Maybe iPTMNet, highest the better): score 3 1597927 1 365128 2 183556 0 48159 4 31124 Name: count, dtype: int64 isoform Protein feature start end group agent note description score 0 A0A023HJ61 NON_TER 16 16 3 1 A0A023I7H2 SIGNAL 1 21 3 2 A0A023I7H2 CHAIN 22 603 NADH-ubiquinone oxidoreductase chain 5 3 3 A0A023I7H2 DOMAIN 68 118 NADH-Ubiquinone oxidoreductase (complex I) cha... 3 4 A0A023I7H2 TRANSMEM 84 105 Helical 3
02.5 Sample Metadata¶
The experimental metadata file contains sample-to-condition mappings, including treatment groups (hypoxia vs. normoxia), timepoints (48hr, 72hr), and replicate information. This is generated from the report.pg_matrix.tsv file by analyzing the column names to extract condition and replicate details and overall standardizing the sample names for consistency.
# Load the raw metadata file
print("Loading sample metadata (from pg_matrix column names)...")
tmp = pd.read_csv(f'{input_path}report.pg_matrix.tsv', sep='\t')
meta = tmp.columns[4:].to_list()
# Build a meta dataframe
meta = pd.DataFrame(
data=meta,
columns=['filename']
)
# --- Clean and Standardize Metadata ---
print("\n🧹 Cleaning and standardizing metadata columns...")
meta['Run'] = meta['filename'].str.split(r'\\').str[-1].str.replace('.d', '')
# Extract condition identifier
def get_condition(x):
if '1%' in x:
return 'True Hypoxia (1% O2)'
elif 'CoCl' in x:
return 'Chemical Hypoxia (CoCl2)'
elif 'H2O2' in x:
return 'Oxidative Stress (H2O2)'
else:
return 'Normoxia (21% O2)'
meta['Condition'] = meta['Run'].map(get_condition)
# Extract the replicate number (special rule for 1% O2)
meta['Replicate'] = meta['Run'].str.split('_').apply(lambda x: x[2] if '1%' in x[1] else x[1][-1])
meta['Duration'] = meta['Run'].str.split('_').str[-5] # last 5th element from the end
meta['Colname'] = meta['Run'].str.split('_').map(
# Simplify the colname based on the condition
lambda x: f"{x[1]}_{x[3]}_{x[2]}" if '1%' in x[1] else f"{x[1][:-1]}_{x[2]}_{x[1][-1]}"
)
# --- Finalize DataFrame ---
print("Selecting final columns and renaming for clarity...")
print("\n✅ Metadata processing complete.")
print(f"Final DataFrame shape: {meta.shape}")
print(f"Final columns: {meta.columns.tolist()}")
print()
# Sort the metadata DataFrame for better readability
meta = meta.sort_values(
['Condition', 'Duration', 'Replicate'],
ascending=[False, True, True]
).reset_index(drop=True)
# Keep only the relevant conditions for the study
relevant_conditions = [
'True Hypoxia (1% O2)',
'Normoxia (21% O2)',
]
print(f"Filtering metadata to keep only relevant conditions: {relevant_conditions}")
meta = meta[meta['Condition'].isin(relevant_conditions)].reset_index(drop=True)
print(f"Filtered DataFrame shape: {meta.shape}")
meta
Loading sample metadata (from pg_matrix column names)... 🧹 Cleaning and standardizing metadata columns... Selecting final columns and renaming for clarity... ✅ Metadata processing complete. Final DataFrame shape: (32, 6) Final columns: ['filename', 'Run', 'Condition', 'Replicate', 'Duration', 'Colname'] Filtering metadata to keep only relevant conditions: ['True Hypoxia (1% O2)', 'Normoxia (21% O2)'] Filtered DataFrame shape: (16, 6)
| filename | Run | Condition | Replicate | Duration | Colname | |
|---|---|---|---|---|---|---|
| 0 | D:\Data\Tam\RevisionLungDIA\H358\H358_1%_1_48_... | H358_1%_1_48_DIA_BB4_1_5432 | True Hypoxia (1% O2) | 1 | 48 | 1%_48_1 |
| 1 | D:\Data\Tam\RevisionLungDIA\H358\H358_1%_2_48_... | H358_1%_2_48_DIA_BC4_1_5440 | True Hypoxia (1% O2) | 2 | 48 | 1%_48_2 |
| 2 | D:\Data\Tam\RevisionLungDIA\H358\H358_1%_3_48_... | H358_1%_3_48_DIA_BD4_1_5448 | True Hypoxia (1% O2) | 3 | 48 | 1%_48_3 |
| 3 | D:\Data\Tam\RevisionLungDIA\H358\H358_1%_4_48_... | H358_1%_4_48_DIA_BE4_1_5456 | True Hypoxia (1% O2) | 4 | 48 | 1%_48_4 |
| 4 | D:\Data\Tam\RevisionLungDIA\H358\H358_1%_1_72_... | H358_1%_1_72_DIA_BB8_1_5436 | True Hypoxia (1% O2) | 1 | 72 | 1%_72_1 |
| 5 | D:\Data\Tam\RevisionLungDIA\H358\H358_1%_2_72_... | H358_1%_2_72_DIA_BC8_1_5444 | True Hypoxia (1% O2) | 2 | 72 | 1%_72_2 |
| 6 | D:\Data\Tam\RevisionLungDIA\H358\H358_1%_3_72_... | H358_1%_3_72_DIA_BD8_1_5452 | True Hypoxia (1% O2) | 3 | 72 | 1%_72_3 |
| 7 | D:\Data\Tam\RevisionLungDIA\H358\H358_1%_4_72_... | H358_1%_4_72_DIA_BE8_1_5460 | True Hypoxia (1% O2) | 4 | 72 | 1%_72_4 |
| 8 | D:\Data\Tam\RevisionLungDIA\H358\H358_C1_48_DI... | H358_C1_48_DIA_BB1_1_5429 | Normoxia (21% O2) | 1 | 48 | C_48_1 |
| 9 | D:\Data\Tam\RevisionLungDIA\H358\H358_C2_48_DI... | H358_C2_48_DIA_BC1_1_5437 | Normoxia (21% O2) | 2 | 48 | C_48_2 |
| 10 | D:\Data\Tam\RevisionLungDIA\H358\H358_C3_48_DI... | H358_C3_48_DIA_BD1_1_5445 | Normoxia (21% O2) | 3 | 48 | C_48_3 |
| 11 | D:\Data\Tam\RevisionLungDIA\H358\H358_C4_48_DI... | H358_C4_48_DIA_BE1_1_5453 | Normoxia (21% O2) | 4 | 48 | C_48_4 |
| 12 | D:\Data\Tam\RevisionLungDIA\H358\H358_C1_72_DI... | H358_C1_72_DIA_BB5_1_5433 | Normoxia (21% O2) | 1 | 72 | C_72_1 |
| 13 | D:\Data\Tam\RevisionLungDIA\H358\H358_C2_72_DI... | H358_C2_72_DIA_BC5_1_5441 | Normoxia (21% O2) | 2 | 72 | C_72_2 |
| 14 | D:\Data\Tam\RevisionLungDIA\H358\H358_C3_72_DI... | H358_C3_72_DIA_BD5_1_5449 | Normoxia (21% O2) | 3 | 72 | C_72_3 |
| 15 | D:\Data\Tam\RevisionLungDIA\H358\H358_C4_72_DI... | H358_C4_72_DIA_BE5_1_5457 | Normoxia (21% O2) | 4 | 72 | C_72_4 |
The metadata defines the experimental design with 16 samples across two conditions (True Hypoxia and Normoxia) at two timepoints (48hr and 72hr). Each condition-timepoint combination has 4 biological replicates. filename, Run, Condition, Duration, Replicate, and Colname is created to be in the metadata. Colname is shortened and standardized version of the sample names that will be used throughout the analysis.
02.6 Quantitative Proteomics Data¶
The quantitative data from DIA-NN output contains peptide-level intensity values across all samples. The report.parquet file is loaded to obtain peptide sequences, protein accessions, and intensity measurements for each of the 16 samples. This reports precursor-level data that will be aggregated to peptide-level and mapped to proteins.
print("Loading precursors long data...")
# Load and preprocess hypoxia experiment data efficiently
data = pd.read_parquet(f'{input_path}report.parquet')
print(f"Data shape (long format): {data.shape}")
print(f"Columns: {data.columns.tolist()}")
print("\n--- 📊 Data Preparation: Pivoting and Cleaning ---")
print("Pivoting the data to wide format (use sum of multiple precursors)...")
print(" - Using normalized precursor intensities for quantification.")
print(" - Doesn't have modified sequences using stripped sequences as peptide identifiers.")
print(" - Protein IDs may contain isoforms or multiple entries, which will be handled later.")
data_wide = (
data.pivot_table(
index=['Protein.Ids', 'Stripped.Sequence'],
columns='Run',
values='Precursor.Normalised',
aggfunc='sum'
)
.reset_index()
.rename(columns={
'Protein.Ids': 'Protein',
'Stripped.Sequence': 'Peptide'
})
)
# Display summary information
print(f"Data shape (wide format): {data_wide.shape}")
print(f"Columns: {data_wide.columns.tolist()}")
print("\n--- 🧬 Protein ID Representation ---")
# Check if the protein contains '-' indicating it has isoforms
has_isoform = data_wide['Protein'].str.contains('-').any()
has_multiples = data_wide['Protein'].str.contains(';').any()
# Friendly print statements to explain the protein ID representation
if has_isoform:
print("The dataset contains protein IDs with isoforms (e.g., 'P12345-2'). These represent specific variants of a protein.")
else:
print("The dataset does not contain protein IDs with isoforms.")
if has_multiples:
print("The dataset contains protein IDs with multiple entries (e.g., 'P12345; P67890'). These represent proteins that may have multiple forms or related proteins.")
print()
print("Will explode these entries to create a more detailed DataFrame.")
data_wide = process.explode_aligned(data_wide, ['Protein'], sep=';', verbose=verbose)
print()
print(f"After exploding, the dataset now has {len(data_wide)} rows and {len(data_wide.columns)} columns.")
else:
print("The dataset does not contain protein IDs with multiple entries.")
print(f"Values below 2 should be set as NaN, they might be artifacts of quantification.")
print(f"Number of missing values in the DataFrame before cleanup: {data_wide.isnull().sum().sum()}")
# Identify numeric columns in the DataFrame
numeric_columns = data_wide.select_dtypes(include=[np.number]).columns
data_wide[numeric_columns] = data_wide[numeric_columns].mask(
data_wide[numeric_columns] < 2
)
print(f"Number of missing values in the DataFrame after cleanup: {data_wide.isnull().sum().sum()}")
data_wide = data_wide.set_index(['Protein', 'Peptide'])[meta['Run']]
print("Set cleaned column names from metadata")
data_wide.columns = meta['Colname']
data_wide = data_wide.sort_index()
data_wide.head()
Loading precursors long data... Data shape (long format): (2988832, 67) Columns: ['Run.Index', 'Run', 'Channel', 'Precursor.Id', 'Precursor.Channels.Group', 'Modified.Sequence', 'Stripped.Sequence', 'Precursor.Charge', 'Precursor.Lib.Index', 'Proteotypic', 'Precursor.Mz', 'Protein.Ids', 'Protein.Group', 'Protein.Names', 'Genes', 'RT', 'iRT', 'Predicted.RT', 'Predicted.iRT', 'IM', 'iIM', 'Predicted.IM', 'Predicted.iIM', 'Precursor.Quantity', 'Precursor.Normalised', 'Ms1.Area', 'Ms1.Normalised', 'Ms1.Apex.Area', 'Normalisation.Factor', 'Quantity.Quality', 'Empirical.Quality', 'Normalisation.Noise', 'Ms1.Profile.Corr', 'Averagine', 'Evidence', 'Mass.Evidence', 'Ms1.Total.Signal.Before', 'Ms1.Total.Signal.After', 'RT.Start', 'RT.Stop', 'FWHM', 'PG.Normalised', 'PG.MaxLFQ', 'Genes.Normalised', 'Genes.MaxLFQ', 'Genes.MaxLFQ.Unique', 'PG.MaxLFQ.Quality', 'Q.Value', 'PEP', 'Global.Q.Value', 'Lib.Q.Value', 'Peptidoform.Q.Value', 'Global.Peptidoform.Q.Value', 'Lib.Peptidoform.Q.Value', 'PTM.Q.Value', 'PTM.Site.Confidence', 'Site.Occupancy.Probabilities', 'Protein.Sites', 'Lib.PTM.Site.Confidence', 'Translated.Q.Value', 'Channel.Q.Value', 'PG.Q.Value', 'PG.PEP', 'GG.Q.Value', 'Protein.Q.Value', 'Global.PG.Q.Value', 'Lib.PG.Q.Value'] --- 📊 Data Preparation: Pivoting and Cleaning --- Pivoting the data to wide format (use sum of multiple precursors)... - Using normalized precursor intensities for quantification. - Doesn't have modified sequences using stripped sequences as peptide identifiers. - Protein IDs may contain isoforms or multiple entries, which will be handled later. Data shape (wide format): (104861, 34) Columns: ['Protein', 'Peptide', 'H358_1%_1_48_DIA_BB4_1_5432', 'H358_1%_1_72_DIA_BB8_1_5436', 'H358_1%_2_48_DIA_BC4_1_5440', 'H358_1%_2_72_DIA_BC8_1_5444', 'H358_1%_3_48_DIA_BD4_1_5448', 'H358_1%_3_72_DIA_BD8_1_5452', 'H358_1%_4_48_DIA_BE4_1_5456', 'H358_1%_4_72_DIA_BE8_1_5460', 'H358_C1_48_DIA_BB1_1_5429', 'H358_C1_72_DIA_BB5_1_5433', 'H358_C2_48_DIA_BC1_1_5437', 'H358_C2_72_DIA_BC5_1_5441', 'H358_C3_48_DIA_BD1_1_5445', 'H358_C3_72_DIA_BD5_1_5449', 'H358_C4_48_DIA_BE1_1_5453', 'H358_C4_72_DIA_BE5_1_5457', 'H358_CoCl1_48_DIA_BB2_1_5430', 'H358_CoCl1_72_DIA_BB6_1_5434', 'H358_CoCl2_48_DIA_BC2_1_5438', 'H358_CoCl2_72_DIA_BC6_1_5442', 'H358_CoCl3_48_DIA_BD2_1_5446', 'H358_CoCl3_72_DIA_BD6_1_5450', 'H358_CoCl4_48_DIA_BE2_1_5454', 'H358_CoCl4_72_DIA_BE6_1_5458', 'H358_H2O21_48_DIA_BB3_1_5431', 'H358_H2O21_72_DIA_BB7_1_5435', 'H358_H2O22_48_DIA_BC3_1_5439', 'H358_H2O22_72_DIA_BC7_1_5443', 'H358_H2O23_48_DIA_BD3_1_5447', 'H358_H2O23_72_DIA_BD7_1_5451', 'H358_H2O24_48_DIA_BE3_1_5455', 'H358_H2O24_72_DIA_BE7_1_5459'] --- 🧬 Protein ID Representation --- The dataset does not contain protein IDs with isoforms. The dataset contains protein IDs with multiple entries (e.g., 'P12345; P67890'). These represent proteins that may have multiple forms or related proteins. Will explode these entries to create a more detailed DataFrame. 🚀 Starting aligned explosion for columns: ['Protein'] Original DataFrame shape: (104861, 34) Final DataFrame shape: (112248, 34) ✅ Transformation complete in 1.0261 seconds. After exploding, the dataset now has 112248 rows and 34 columns. Values below 2 should be set as NaN, they might be artifacts of quantification. Number of missing values in the DataFrame before cleanup: 610106 Number of missing values in the DataFrame after cleanup: 610106 Set cleaned column names from metadata
| Colname | 1%_48_1 | 1%_48_2 | 1%_48_3 | 1%_48_4 | 1%_72_1 | 1%_72_2 | 1%_72_3 | 1%_72_4 | C_48_1 | C_48_2 | C_48_3 | C_48_4 | C_72_1 | C_72_2 | C_72_3 | C_72_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Protein | Peptide | ||||||||||||||||
| A0A024B7W1 | LAILMGATFAEMNTGGDVAHLALIAAFK | 2394.7788 | 2653.8262 | 4374.3252 | 4618.3652 | NaN | 3560.3354 | 5652.7100 | NaN | 3485.5161 | 2289.0959 | 4028.9673 | 4263.6060 | NaN | NaN | 3707.3254 | NaN |
| A0A024RBG1 | EVYEEAGVK | 56351.1875 | 63097.6523 | 49170.1328 | 55968.3789 | NaN | 41009.9844 | 51193.7461 | 54011.7969 | 62612.5977 | 63970.3516 | 37216.3477 | 44602.7344 | 58097.2539 | 73113.1484 | 65346.1016 | 56678.7344 |
| EVYEEAGVKGK | NaN | NaN | 4063.8176 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | |
| EWFKVEDAIK | 1051.2585 | NaN | 4486.0947 | 5052.0957 | NaN | 7127.1855 | 6379.5830 | 3082.0085 | NaN | 2105.7913 | 4573.0874 | 5685.3628 | 2177.9431 | NaN | 6287.4355 | 6404.9004 | |
| LLGIFEQNQDR | 16599.4629 | 13334.4561 | 12436.4561 | 16408.7402 | NaN | 16807.1602 | 13422.2295 | 11881.2012 | 14877.3135 | 11963.4307 | 13533.5078 | 18642.8730 | 16161.4492 | 18011.0938 | 16386.9121 | 13474.7529 |
Findings: The DIA-NN quantitative data has been loaded and pivoted to wide format with:
- Peptide sequences (stripped) as row identifiers
- Sample intensities as columns (normalized precursor values)
- Multi-protein assignments exploded to individual rows
- Low-value artifacts (<2) set to missing
Now the all data tables is ready for the next steps of sequence mapping and quality control.
03. Sequence Mapping¶
This section maps detected peptide sequences to their parent protein sequences to establish positional information and calculate sequence coverage metrics.
# Build a dictionary to match condition_colors keys: 'Condition - Duration Hr'
condition_to_samples = {}
for condition in meta['Condition'].unique():
for duration in sorted(meta.loc[meta['Condition'] == condition, 'Duration'].unique()):
key = f"{condition} - {duration} Hr"
samples = meta.loc[(meta['Condition'] == condition) & (meta['Duration'] == duration), 'Colname'].tolist()
condition_to_samples[key] = samples
# samples_to_condition mapping
samples_to_condition = {sample: condition for condition, samples in condition_to_samples.items() for sample in samples}
# samples_to_colors mapping
samples_to_colors = {sample: condition_colors[condition] for sample, condition in samples_to_condition.items()}
03.1 Protein Overlap Analysis¶
First, all identified proteins compared against the fasta reference to ensure they exist in the database and to retrieve their sequences for peptide mapping. While might not be necessary, we also examine the intersection, and differences between identified proteins and reference proteome (from UniProt FASTA).
# Extract protein information and reset indices
info_data = data_wide.reset_index()[['Protein', 'Peptide']]
data_wide = data_wide.reset_index(drop=True)
print(f"✓ Info data extracted: {info_data.shape[0]:,} peptides")
print()
# =============================================================================
# PROTEIN REFERENCE COMPARISON
# =============================================================================
print("🧬 PROTEIN OVERLAP ANALYSIS")
print("=" * 50)
# Calculate protein sets and overlap
fasta_set = set(fasta_data['Protein'].tolist())
identified_set = set(info_data['Protein'].tolist())
overlap = len(fasta_set & identified_set)
only_reference = len(fasta_set - identified_set)
only_identified = len(identified_set - fasta_set)
total_reference = len(fasta_set)
# Create summary statistics
print(f"📋 PROTEIN INVENTORY:")
print(f" • Reference proteins: {total_reference:,}")
print(f" • Identified proteins: {len(identified_set):,}")
print(f" • Overlap: {overlap:,} ({overlap/total_reference*100:.1f}% coverage)")
print(f" • Unmatched: {only_identified:,}")
print()
# Create clean horizontal bar chart
fig, ax = plt.subplots(1, 1, figsize=(10, 4))
categories = ['Reference Only', 'Overlap\n(Identified)', 'Dataset Only\n(Unmatched)']
values = [only_reference, overlap, only_identified]
colors = ['#c3c3c3', '#139593', '#e54f2a']
bars = ax.barh(categories, values, color=colors, alpha=0.8)
# Add value labels on bars
for i, (bar, value) in enumerate(zip(bars, values)):
width = bar.get_width()
percentage = (value / total_reference * 100) if i != 2 else (value / len(identified_set) * 100)
ax.text(width + max(values)*0.01, bar.get_y() + bar.get_height()/2,
f'{value:,}\n({percentage:.1f}%)',
ha='left', va='center', fontweight='bold')
ax.set_xlabel('Number of Proteins', fontweight='bold')
ax.set_title('Protein Dataset Coverage Analysis', fontweight='bold', pad=15)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.tight_layout()
plots.finalize_plot(
fig, show=True, save=save_to_folder,
filename="fasta_overlap",
# Arguments
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi
)
# Handle unmatched proteins
if only_identified > 0:
unmatched_proteins = identified_set - fasta_set
print("⚠️ UNMATCHED PROTEINS ANALYSIS")
print("-" * 35)
print(f"• {only_identified:,} proteins not in reference (may be isoforms, contaminants, or outdated IDs)")
if only_identified <= 5:
print("• Complete list:")
for protein in sorted(unmatched_proteins):
print(f" - {protein}")
else:
print(f"• Sample (showing 5 of {only_identified:,}):")
for protein in sorted(list(unmatched_proteins))[:5]:
print(f" - {protein}")
# =============================================================================
# DATA CLEANUP
# =============================================================================
print("� CLEANING DATA - REMOVING UNMATCHED PROTEINS")
print("-" * 50)
print(f"• Removing {only_identified:,} unmatched proteins from dataset")
print(f"• Before cleanup: {info_data.shape[0]:,} peptides")
# Remove unmatched proteins from info_data
info_data = info_data[~info_data["Protein"].isin(unmatched_proteins)]
print(f"• After cleanup: {info_data.shape[0]:,} peptides")
# Update main data to match filtered info_data
data_wide = data_wide.loc[info_data.index].reset_index(drop=True)
info_data = info_data.reset_index(drop=True)
# Verify data integrity
if info_data.shape[0] == data_wide.shape[0]:
print("✓ Data shapes match after cleanup")
else:
print("❌ Warning: Data shape mismatch after cleanup")
else:
print("✅ Perfect coverage: All identified proteins found in reference")
✓ Info data extracted: 112,248 peptides 🧬 PROTEIN OVERLAP ANALYSIS ================================================== 📋 PROTEIN INVENTORY: • Reference proteins: 26,772 • Identified proteins: 11,042 • Overlap: 11,035 (41.2% coverage) • Unmatched: 7
⚠️ UNMATCHED PROTEINS ANALYSIS ----------------------------------- • 7 proteins not in reference (may be isoforms, contaminants, or outdated IDs) • Sample (showing 5 of 7): - O90368 - P05878 - P27283 - Q306W8 - Q5XXP4 � CLEANING DATA - REMOVING UNMATCHED PROTEINS -------------------------------------------------- • Removing 7 unmatched proteins from dataset • Before cleanup: 112,248 peptides • After cleanup: 112,240 peptides ✓ Data shapes match after cleanup
There are 7 unmatched proteins (not in reference) and have been removed to ensure valid peptide-to-sequence mapping. The bar chart visualizes the comparison between identified and reference proteins.
03.2 Position Mapping¶
The sequence position mapping locates each peptide within its parent protein sequence. This establishes the start and end positions required for coverage visualization and proteoform analysis.
# =============================================================================
# MERGE FASTA DATA WITH INFO DATA
# =============================================================================
print("🔗 MERGING FASTA DATA WITH INFO DATA")
print("=" * 50)
info_data = info_data.merge(
fasta_data[['Protein', 'Gene', 'Description', 'Length', 'Weight(kDa)', 'Sequence']],
on='Protein',
how='left',
validate='m:1'
)
print(f"✓ Merged info data: {info_data.shape[0]:,} peptides with protein details")
print()
# =============================================================================
# MAP PEPTIDE POSITIONS TO PROTEIN SEQUENCES
# =============================================================================
new_info_data = process.map_peptide_to_protein_positions(
data=info_data,
sequence_col='Sequence',
peptide_col='Peptide',
protein_col='Protein',
# modified_col='Modified Peptide',
# Outputs
peptide_start_col='peptide_start',
peptide_end_col='peptide_end',
unique_id_col='unique_id',
total_occurrences_col='total_occurrences',
occurrence_index_col='occurrence_index',
original_index_col='original_index',
# Processing Options
expand_multiple=True,
add_unique_id=True,
add_occurrence_info=True,
position_ordered=True,
verbose=verbose
)
print()
print(f'Original Info Data shape: {info_data.shape}')
print(f'Original Quant Data shape: {data_wide.shape}')
print()
print("Reindexing the quantitative data to match the new info data...")
info_data, data_wide = process.reindex_quantitative_data(
info_data=new_info_data,
quan_data=data_wide,
original_index_col='original_index',
remove_unmatched=True,
position_col='peptide_start',
verbose=verbose
)
print()
print(f"Final Info Data shape: {info_data.shape}")
print(f"Final Quant Data shape: {data_wide.shape}")
🔗 MERGING FASTA DATA WITH INFO DATA
==================================================
✓ Merged info data: 112,240 peptides with protein details
============================================================
PEPTIDE-TO-PROTEIN POSITION MAPPING
============================================================
Input dataset: 112,240 rows
Detected columns:
- Protein sequence: 'Sequence'
- Peptide sequence: 'Peptide'
- Protein ID: 'Protein'
Processing options:
- Expand multiple occurrences: True
- Add unique IDs: True
- Add occurrence info: True
- Position ordered: True
- Preserve original indices: True
Finding peptide positions...
POSITION FINDING RESULTS:
- Total peptide entries: 112,240
- Successfully matched: 112,240 (100.000%)
- Unmatched peptides: 0 (0.000%)
OCCURRENCE DETAILS:
- Single occurrences: 111,800
- Multiple occurrences: 440
- Total positions found: 113,512
- Max occurrences per peptide: 22
- Average occurrences (matched): 1.01
EXPANSION INFO:
- Expected output rows (with expansion): 113,512
→ 113,512 positioned rows + 0 unmatched rows
- Row expansion factor: 1.01x
Expanding multiple occurrences...
Adding unique proteoform identifiers...
FINAL PROCESSING SUMMARY:
- Input rows: 112,240
- Output rows: 113,512
POSITIONING RESULTS (EXPANDED MODE):
- Original peptides matched: 112,240 (100.000%)
- Original peptides unmatched: 0 (0.000%)
- Total positioned peptides (expanded): 113,512
- Total unpositioned rows: 0
- Expected positioned peptides: 113,512
- Expected unpositioned rows: 0
- Expected total output: 113,512
- Row expansion factor: 1.01x
INSIGHTS:
- Total unique positions mapped: 113,512
- Average positions per mapped peptide: 1.01
- Peptides with >5 occurrences: 77 (max: 22)
============================================================
Ordering peptides by positions using columns: ['Protein', 'peptide_start', 'peptide_end']
Original Info Data shape: (112240, 7)
Original Quant Data shape: (112240, 16)
Reindexing the quantitative data to match the new info data...
============================================================
REINDEXING AND FILTERING QUANTITATIVE DATA
============================================================
Info data shape: (113512, 13)
Quantitative data shape: (112240, 16)
Remove unmatched peptides: True
Position column: 'peptide_start'
Filtered info data shape: (113512, 13)
Original index column: 'original_index'
Original indices range: 0 to 112239
REINDEXING RESULTS:
- Total output rows: 113,512
- Matched rows (copied quantitative data): 113,512
- Remaining unmatched rows (NaN values): 0
- Output shape: (113512, 16)
- Data fill rate: 83.4% (1,514,557/1,816,192 values)
- Row expansion factor: 1.01x
- Max duplications per original row: 22
============================================================
Final Info Data shape: (113512, 12)
Final Quant Data shape: (113512, 16)
The info_data is expanded with peptide positions (start, end). Peptides with multiple occurrences within a protein are expanded with unique identifiers, while this might inflate the number of peptides in some repeating peptides (eg. chain-like peptides), it is done to ensure we are not randomly mapping it to one place and present them all. With this additional expansion, the quantitative data has been reindexed to match the positional information.
04. Data Cleaning¶
This section applies quality control filters and normalization procedures to prepare the data for statistical analysis. The cleaning workflow ensures robust and reproducible downstream results. Additionally, visualizations are generated to assess data quality at each step.
04.1 Overlapping Peptide Grouping¶
There could be many reasons why there are highly overlapping peptides for the same protein identified. While exact reason for them might be hard to pin down, to reduce the complexity and redundancy in the data, we group highly overlapping peptides together. This helps in simplifying the dataset and focusing on unique peptide signals for downstream analysis. The maximum allowed difference for overlap detection is set as 3 amino acids. This ensures if the peptides are overlapping within 3 amino acids, they will be grouped together.
(Note: This is not if two peptide overlap by 3 amino-acids, this means the difference between their start and end positions is less than or equal to 3 amino-acids)
from ProteoForge import peptide_grouper as pg
prv_unq_peps = info_data.shape[0]
prv_unq_prts = info_data['Protein'].nunique()
info_data, data_wide = pg.process_peptide_overlaps(
info_df=info_data,
quan_df=data_wide,
protein_col='Protein',
startpos_col='peptide_start',
endpos_col='peptide_end',
max_diff=3,
verbose=verbose
)
# Check if the indices match
if info_data.shape[0] == data_wide.shape[0]:
print("✓ Data shapes match after peptide grouping")
if info_data.index.equals(data_wide.index):
print("✓ Data indices contents and orders match after peptide grouping")
# Reset indices for consistency
info_data = info_data.reset_index(drop=True)
data_wide = data_wide.reset_index(drop=True)
upd_unq_peps = info_data.shape[0]
upd_unq_prts = info_data['Protein'].nunique()
print()
print(f"Peptide grouping reduced peptides from {prv_unq_peps:,} to {upd_unq_peps:,} ({(1 - upd_unq_peps/prv_unq_peps)*100:.1f}% reduction)")
print(f"Number of proteins remained the same: {prv_unq_prts:,} proteins")
print()
Processing 113512 peptides across 11035 proteins Converting object columns to numeric... ✅ Completed in 12.0s: 102179 representative peptides from 102179 groups ✓ Data shapes match after peptide grouping ✓ Data indices contents and orders match after peptide grouping Peptide grouping reduced peptides from 113,512 to 102,179 (10.0% reduction) Number of proteins remained the same: 11,035 proteins
Overlapping peptides (within 3 positions) have been merged to reduce redundancy. This consolidation reduces the peptide count while maintaining protein coverage, ensuring cleaner downstream proteoform analysis. The both info_data and data_wide (quant data) have been updated to reflect these groupings.
04.2 Protein Coverage Calculation¶
Protein coverage quantifies the proportion of each protein sequence covered by detected peptides. The coverage.calculate_protein_coverage() function computes coverage percentages and trace patterns for quality assessment. The smart calculations allows some overlap considerations to avoid overestimating coverage due to overlapping peptides.
print("📏 CALCULATING PROTEIN COVERAGE & TRACE")
info_data = coverage.calculate_protein_coverage(
info_data,
protein_col='Protein',
peptide_col='Peptide',
seq_len_col='Length',
start_pos_col='peptide_start',
end_pos_col='peptide_end',
unique_id_col='unique_id',
gap=0,
validate_input=True,
n_jobs='auto',
verbose=verbose
)
📏 CALCULATING PROTEIN COVERAGE & TRACE Processing 11035 proteins with 102179 total peptides Auto-selected: multiprocessing with 8 processes Starting multiprocessing coverage calculation... Multiprocessing completed in 3.58s Applying results to DataFrame... Building trace mappings (large dataset)... Finalizing results... Average protein coverage: 20.6% Generated DataFrame with 102179 rows and 15 columns
Protein sequence coverage has been calculated for all proteins (Avg 20.6%). Coverage percentages (Cov%) indicate the fraction of each protein's amino acid sequence represented by detected peptides. This information is added to info_data for downstream filtering and analysis.
Proteomic Coverage Visualization¶
The data-wide protein coverage plots visualize the relationship between protein length, peptide count, and sequence coverage. This high-level overview helps put in perspective how well proteins are represented in the dataset. This is not crucial but as a nice-to-have visualization for understanding the dataset better.
# 1. Aggregate and transform the data
# This step remains the same as your original code
plot_data = info_data.groupby("Protein").agg(
Length=("Length", "first"),
PeptideCount=("Peptide", "count"),
CovPerc=("Cov%", "first")
).reset_index()
plot_data["Length"] = np.log10(plot_data["Length"])
plot_data["PeptideCount"] = np.log10(plot_data["PeptideCount"])
# 2. Set up the figure and subplots
# Create a figure with 1 row and 2 columns of subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5.5), constrained_layout=True)
fig.suptitle(
"Proteomic Quality Assessment: Linking Protein Length with Peptide Count and Sequence Coverage",
fontsize=16,
fontweight="bold"
)
# --- Subplot A: Protein Length vs. Peptide Count ---
ax1 = axes[0]
sns.scatterplot(
ax=ax1,
data=plot_data,
x="Length",
y="PeptideCount",
hue="CovPerc",
size="CovPerc",
palette="viridis_r", # Reversed Viridis is great for this
sizes=(10, 200),
alpha=0.7,
edgecolor="black",
linewidth=0.4,
)
ax1.set_title("Length vs. Peptide Count by Coverage", fontsize=12, style="italic")
ax1.set_xlabel("$Log_{10}$(Protein Length)", fontsize=11)
ax1.set_ylabel("$Log_{10}$(Peptide Count)", fontsize=11)
ax1.grid(True, which="both", linestyle="--", linewidth=0.5, alpha=0.7)
ax1.legend(
title="Coverage %", loc="upper left", frameon=False,
facecolor="#FFFFFFCC", edgecolor="black",
fontsize=9, title_fontsize=10
)
ax1.text(-0.1, 1.05, 'A', transform=ax1.transAxes, size=18, weight='bold') # Subplot label
# --- Subplot B: Protein Coverage Distribution ---
ax2 = axes[1]
# Use a complementary color from the start of the viridis palette
hist_color = "#440154"
sns.histplot(
ax=ax2,
data=plot_data,
x="CovPerc",
bins=15,
kde=True,
color=hist_color,
alpha=0.75,
edgecolor="white",
linewidth=0.8,
)
ax2.set_title("Protein Coverage Distribution", fontsize=12, style="italic")
ax2.set_xlabel("Sequence Coverage (%)", fontsize=11)
ax2.set_ylabel("Protein Frequency", fontsize=11)
ax2.grid(True, which="both", linestyle="--", linewidth=0.5, alpha=0.7)
ax2.text(-0.1, 1.05, 'B', transform=ax2.transAxes, size=18, weight='bold') # Subplot label
# 3. Final adjustments
# Despine removes the top and right spines for a cleaner look
sns.despine(fig=fig, top=True, right=True, left=True, bottom=False)
plots.finalize_plot(
fig, show=True, save=save_to_folder,
filename='proteinLength_peptideCount_coverage',
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi
)
# Print the number of proteins with higher than 75% coverage
high_coverage_count = plot_data[plot_data['CovPerc'] > 75].shape[0]
print(f"Number of proteins with coverage > 75%: {high_coverage_count}")
print(f"Number of proteins with coverage > 90%: {plot_data[plot_data['CovPerc'] > 90].shape[0]}")
high_coverage_proteins = plot_data[plot_data['CovPerc'] > 90].sort_values(by='CovPerc', ascending=False)['Protein'].tolist()
print()
for protein in high_coverage_proteins:
gene = info_data.loc[info_data['Protein'] == protein, 'Gene'].iloc[0]
description = info_data.loc[info_data['Protein'] == protein, 'Description'].iloc[0]
prtLength = info_data.loc[info_data['Protein'] == protein, 'Length'].iloc[0]
print(f"{protein:<10} ({gene:<5}) - {description:<50} - Length: {prtLength:>4} aa")
Number of proteins with coverage > 75%: 33 Number of proteins with coverage > 90%: 3 P61604 (HSPE1) - 10 kDa heat shock protein, mitochondrial - Length: 102 aa P08727 (KRT19) - Keratin, type I cytoskeletal 19 - Length: 400 aa O76070 (SNCG ) - Gamma-synuclein - Length: 127 aa
This figure, provides a scatter plot and histogram to visualize protein coverage statistics. In scatter plot the log10 transformed protein lengths and peptide counts are plotted, with point size and color indicating sequence coverage percentage. The histogram displays the distribution of sequence coverage across all proteins. The histogram is almost always right-skewed, indicating that many proteins have low coverage while a few have high coverage. Unless the dataset is extremely well-covered, this pattern is expected. The numbers of proteins with more than 75% and 90% coverage is printed above, and 90% > ones are generally very few in typical datasets, and they are individually printed out for reference.
# Check if info_data has the unique_id column
info_data = info_data.set_index(['Protein', 'unique_id'], drop=False)
data_wide.index = info_data.index
# Ensure the columns are numeric
data_wide = data_wide.apply(pd.to_numeric, errors='coerce')
04.3 Filter Proteins by Peptide Count¶
Minimum peptide requirement: ProteoForge requires at least 4 peptides per protein for reliable proteoform detection. Proteins with fewer peptides are excluded from downstream analysis. While this might look too stringent, having multiple peptides per proteoform is crucial to have an accurate outlier detection and clustering. Going lower than this threshold can compromise the reliability of proteoform assignments.
tmp = info_data.reset_index(drop=True).copy()
prot_size = tmp.groupby('Protein')['Peptide'].nunique().sort_values(ascending=False)
# Print summary statistics
total_proteins = len(prot_size)
num_proteins_ge4 = (prot_size >= 4).sum()
num_proteins_lt4 = (prot_size < 4).sum()
print(f"Total proteins detected: {total_proteins}")
print(f"Proteins with ≥4 peptides: {num_proteins_ge4}")
print(f"Proteins with <4 peptides: {num_proteins_lt4}")
print("\nMinimum of 4 peptides per protein is required for ProteoForge, only proteins with ≥4 peptides will be used for downstream analysis.")
print(f"This would retain {num_proteins_ge4} proteins and exclude {num_proteins_lt4} proteins.")
# Enhanced visualization of protein size distribution
fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(np.log10(prot_size), bins=30, kde=True, color="#139593", ax=ax, edgecolor='black', alpha=0.8)
# Add vertical line at log10(4)
ax.axvline(np.log10(4), color='red', linestyle='--', linewidth=2, label='Peptide Size = 4')
# Add text box summarizing the filtering effect
ax.text(
0.75, ax.get_ylim()[1]*0.8,
f"Total proteins: {total_proteins}\nRetained (≥4 peptides): {num_proteins_ge4}\nExcluded (<4 peptides): {num_proteins_lt4}",
fontsize=11, color='navy', bbox=dict(facecolor='white', alpha=0.7, boxstyle='round,pad=0.5')
)
ax.set_ylabel('Frequency', fontsize=13)
ax.set_xlabel('Log10(Number of Peptides per Protein)', fontsize=13)
ax.set_title('Distribution of Protein Sizes (Peptides per Protein)', fontsize=15, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(which='both', linestyle='--', linewidth=0.5, color='lightgray', alpha=0.75)
plt.tight_layout()
plots.finalize_plot(
fig, show=True, save=save_to_folder,
filename='proteinSize_distribution',
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi
)
print()
print("Updating the main data by keeping only proteins with at least 4 peptides...")
print(f"- Data shape before filtering: {data_wide.shape}")
print(f"- Proteins before filtering: {data_wide.index.get_level_values('Protein').nunique()}")
print(f"- Peptides before filtering: {data_wide.index.get_level_values('unique_id').nunique()}")
data_wide = data_wide.loc[data_wide.index.get_level_values('Protein').isin(prot_size[prot_size >= 4].index)]
print("Filtering complete.")
print(f"- Data shape after filtering: {data_wide.shape}")
print(f"- Proteins after filtering: {data_wide.index.get_level_values('Protein').nunique()}")
print(f"- Peptides after filtering: {data_wide.index.get_level_values('unique_id').nunique()}")
Total proteins detected: 11035 Proteins with ≥4 peptides: 7161 Proteins with <4 peptides: 3874 Minimum of 4 peptides per protein is required for ProteoForge, only proteins with ≥4 peptides will be used for downstream analysis. This would retain 7161 proteins and exclude 3874 proteins.
Updating the main data by keeping only proteins with at least 4 peptides... - Data shape before filtering: (102179, 16) - Proteins before filtering: 11035 - Peptides before filtering: 102179 Filtering complete. - Data shape after filtering: (95369, 16) - Proteins after filtering: 7161 - Peptides after filtering: 95369
The protein size distribution shows the number of peptides quantified per protein. The vertical red dashed line marks the ≥4 peptide threshold required for ProteoForge analysis. Proteins below this threshold lack sufficient number of peptide requirements and have been filtered out, which is approximately 35% of the proteins in this dataset.
Note: Again this might look too stringent and aggressive filtering, and in some special cases, one might consider lowering this threshold to 3 peptides, but it is not recommended.
04.4 Outlier Sample Detection¶
This is a standard way of detecting if any sample behaves abnormally compared to the rest of the dataset, specifically within their biological groups. Such outlier samples can skew the analysis, cause lower confidence by increasing variability, and lead to misleading conclusions. Therefore, identifying and removing these outliers is crucial for maintaining data integrity. There can be many reasons for a sample to behave abnormally, including technical issues during sample preparation, instrument errors during data acquisition, or biological variability. Identifying these outliers helps ensure that the analysis focuses on reliable and representative data.
The method that we have been using is to examine correlation, number of quantified samples, and coefficient of variation within biological groups. Within each and between all these metrics if a sample is behaving outside of allowed IQR ranges or has low correlation with other samples in the same group, it is flagged as an outlier. If it is flagged multiple times (here is 3/7), it is considered an outlier and removed from the dataset.
Outlier detection identifies samples with anomalous intensity distributions using pairwise correlation, quantification ratios, and coefficient of variation metrics. Samples flagged as outliers across multiple metrics are removed to ensure data quality.
metric_weights = { # Weights for the metrics used in pairwise combined score calculation
'correlation': 1.0,
'quantification': 1.0,
'cv': 1.0
}
minOutlier_occurrance = 3 # minimum number of outlier occurrences to consider a sample behaves like an outlier
conditions_to_check = list(condition_to_samples.keys()) # List of conditions to check for outlier detection
print("\nCalculating Pairwise Combined Scores...")
print(" - Using metric weights:", metric_weights)
print(" - Using Spearman correlation, quantification as ratio, and CV as 1 - CV")
all_outlier_samples, combined_group_reports = utils.identify_outlier_samples(
data=data_wide,
sample_groups=condition_to_samples,
analysis_func=utils.run_metric_combination_analysis,
min_outlier_occurrence=minOutlier_occurrance,
verbose=False
)
print(f"\nComplete list of unique outlier samples to remove (after min_outlier_occurrence >= {minOutlier_occurrance})\n - {all_outlier_samples}")
print()
plot_data = data_wide.reset_index().melt(
id_vars=['Protein', 'unique_id'],
var_name='Sample',
value_name='Intensity'
).dropna().assign(log2_Intensity=lambda x: np.log2(x['Intensity']))
fig, ax = plt.subplots(figsize=(12, 4))
sns.boxplot(
x='Sample', y='log2_Intensity', hue='Sample', data=plot_data, ax=ax,
palette=samples_to_colors,
fliersize=0.5,
linewidth=0.5,
notch=False,
showmeans=True,
meanprops={"marker": "o", "markerfacecolor": "white", "markeredgecolor": "black"},
medianprops={"color": "white", "linewidth": 2.5},
whiskerprops={"color": "black", "linewidth": 1},
capprops={"color": "black", "linewidth": 0},
legend=False
)
y_min = 0
y_max = plot_data['log2_Intensity'].max() + 1
ax.set_ylim(y_min, y_max)
ax.tick_params(axis='x', labelrotation=90)
ax.set_title('Boxplot of Log2 Intensities Across Samples', fontsize=16, fontweight='bold')
ax.set_xlabel('Sample', fontsize=14)
ax.set_ylabel('Log2(Intensity)', fontsize=14)
ax.grid(which='both', linestyle='--', linewidth=0.5, color='lightgray', alpha=0.7)
import matplotlib.patches as mpatches
# Highlight background for outlier samples
for i, label in enumerate(ax.get_xticklabels()):
sample_label = label.get_text()
if sample_label in all_outlier_samples:
# Add a rectangle spanning the y-range at the x position
ax.add_patch(
mpatches.Rectangle(
(i - 0.5, y_min), # x position, y position
1, # width (one boxplot)
y_max - y_min, # height
color='#ffcccc', # light red
zorder=0, # behind the boxplot
alpha=0.5
)
)
# Add legend for outlier highlight
outlier_patch = mpatches.Patch(color='#ffcccc', label='Identified Outlier Sample', alpha=0.5)
ax.legend(handles=[outlier_patch], loc='upper right', fontsize=12)
plt.tight_layout()
plots.finalize_plot(
fig, show=True, save=save_to_folder,
filename='boxplot_log2Intensity_with_outliers',
filepath=figure_path,
formats=figure_formats,
transparent=transparent_bg,
dpi=figure_dpi
)
# Remove the identified outlier samples from the main data
print("\nRemoving identified outlier samples from the main data...")
print(f"- Samples to remove: {all_outlier_samples}")
print(f"- Data shape before removal: {data_wide.shape}")
cleaned_data = data_wide.drop(columns=all_outlier_samples)
print(f"- Data shape after removal: {cleaned_data.shape}")
print("Outlier removal complete.")
Calculating Pairwise Combined Scores...
- Using metric weights: {'correlation': 1.0, 'quantification': 1.0, 'cv': 1.0}
- Using Spearman correlation, quantification as ratio, and CV as 1 - CV
Complete list of unique outlier samples to remove (after min_outlier_occurrence >= 3)
- ['1%_72_1', 'C_72_4']
Removing identified outlier samples from the main data... - Samples to remove: ['1%_72_1', 'C_72_4'] - Data shape before removal: (95369, 16) - Data shape after removal: (95369, 14) Outlier removal complete.
meta = meta[~meta['Colname'].isin(all_outlier_samples)].reset_index(drop=True)
# Build a dictionary to match condition_colors keys: 'Condition - Duration Hr'
condition_to_samples = {}
for condition in meta['Condition'].unique():
for duration in sorted(meta.loc[meta['Condition'] == condition, 'Duration'].unique()):
key = f"{condition} - {duration} Hr"
samples = meta.loc[(meta['Condition'] == condition) & (meta['Duration'] == duration), 'Colname'].tolist()
condition_to_samples[key] = samples
# samples_to_condition mapping
samples_to_condition = {sample: condition for condition, samples in condition_to_samples.items() for sample in samples}
# samples_to_colors mapping
samples_to_colors = {sample: condition_colors[condition] for sample, condition in samples_to_condition.items()}
Outlier detection identified 2 samples with anomalous intensity distributions based on pairwise correlation, quantification ratios, and CV metrics. Samples flagged as outliers (≥3 metric violations) are highlighted in the boxplot with light red background and removed from further analysis. The metadata and sample mappings are updated accordingly.
Note1: This is one thing that is highly dataset dependent. In some datasets, there might be no outliers at all, while in some datasets there might be multiple outliers. The key is to have a robust method to identify them based on multiple metrics rather than relying on a single criterion.
Note2: I wasn't sure if I should do this step, since I already have low N, but wanted to keep the data consistent with my other analyses.
04.5 Filter by Minimum Quantification¶
Removes peptides with insufficient quantification across conditions. Entries must have at least 2 valid measurements in at least one condition to be retained for analysis. This ensures that only peptides with adequate data for statistical comparisons are included in downstream analyses. This enables we remove peptides that are mostly missing across all samples or conditions, which would not provide reliable information for proteoform detection, and don't want to include them to later impute with low-values which would not really add any value to the analysis.
min_limit = 2 # Minimum number of peptides or intensity to consider a valid quantification
metric_type = 'Count' # 'Count' for number of peptides, 'Intensity' for summed intensity
filter_type = 'or' # 'or' for any condition, 'and' for all conditions
# --- Highlighted Print Statements ---
print("="*80)
print(f"🔬 CONFIGURATION: Filtering by '{metric_type}' with a threshold of ≥ {min_limit}.")
print(f"Logic: Keep entries if {'ANY' if filter_type == 'or' else 'ALL'} condition(s) meet the threshold.")
print("="*80)
# --- Calculating Statistics ---
print("\n📊 Calculating quantification statistics...")
all_stats = {}
for cond, cols in condition_to_samples.items():
quantified_count = cleaned_data[cols].notna().sum(axis=1)
total_cols_in_condition = len(cols)
quantified_pctl = (quantified_count / total_cols_in_condition) * 100
# Add the results to our dictionary
all_stats[f'{cond} Quantified Count'] = quantified_count
all_stats[f'{cond} Quantified Percentage'] = quantified_pctl.round(2) # Rounding for clarity
stats_df = pd.DataFrame(all_stats)
print(f"Total entries identified: {len(stats_df):,}")
# --- Filtering Logic and Results ---
print("\n" + "-"*80)
print("🔍 APPLYING FILTER...")
if filter_type == 'or':
# Keep rows where at least one condition meets the limit
idx_to_keep = (stats_df.loc[:, stats_df.columns.str.contains(metric_type)] >= min_limit).any(axis=1)
elif filter_type == 'and':
# Keep rows where all conditions meet the limit
idx_to_keep = (stats_df.loc[:, stats_df.columns.str.contains(metric_type)] >= min_limit).all(axis=1)
else:
raise ValueError("Invalid filter_type. Use 'or' or 'and'.")
# --- Summarize Results ---
total_count = len(stats_df)
kept_count = idx_to_keep.sum()
percentage_kept = (kept_count / total_count) * 100
print("\n" + "="*80)
print("✅ FILTERING COMPLETE: SUMMARY")
print("="*80)
print(f"{'Initial Entries:':<20} {total_count:>10,}")
print(f"{'Entries Kept:':<20} {kept_count:>10,}")
print(f"{'Entries Discarded:':<20} {total_count - kept_count:>10,}")
print(f"{'Percentage Kept:':<20} {percentage_kept:>9.2f}%")
print("="*80)
================================================================================ 🔬 CONFIGURATION: Filtering by 'Count' with a threshold of ≥ 2. Logic: Keep entries if ANY condition(s) meet the threshold. ================================================================================ 📊 Calculating quantification statistics... Total entries identified: 95,369 -------------------------------------------------------------------------------- 🔍 APPLYING FILTER... ================================================================================ ✅ FILTERING COMPLETE: SUMMARY ================================================================================ Initial Entries: 95,369 Entries Kept: 93,309 Entries Discarded: 2,060 Percentage Kept: 97.84% ================================================================================
The quantification filter retains peptides with at least 2 valid intensity measurements in at least one experimental condition. This kept close to 98% of the data, ensuring only the sufficient data points for reliable statistical analysis while maintaining coverage across the dataset.
04.6 Median Centering Normalization¶
Median centering corrects for systematic intensity differences between samples by aligning sample medians while preserving the original intensity magnitude scale.
centered_data = normalize.by_median_centering(
df=cleaned_data,
rescale_to_original_magnitude=True,
condition_map=None, # Pass dictionary if want to apply condition-specific centering
verbose=verbose
)
# plots.normalized_density_comparison(
# before_df=data_wide_cleaned,
# after_df=centered_data,
# condition_map=condition_to_samples,
# color_palette=condition_colors,
# log_transform=True,
# title="\nDensity of Samples",
# show=True,
# save=save_to_folder,
# filename="DensityPlot_Normalization_Comparison",
# fileformats=figure_formats,
# filepath=figure_path,
# transparent=transparent_bg,
# dpi=figure_dpi,
# )
Applying Median Centering...
- Applying global median centering...
- Rescaling by a global factor of: 16384.00
- Data after median centering has shape: (95369, 14)
- Missing values (Before/After): (145644/145644)
Median centering is standard, non-invasive normalization technique to adjust for systematic intensity differences between samples. It aligns sample medians while preserving the original intensity magnitude scale, ensuring comparability across samples without distorting relative abundance relationships. One can do within biological groups as well, but here since we have only 2 conditions with low N, we are doing it across all samples.
04.7 Missing Value Imputation¶
Multi-strategy imputation addresses different missingness patterns:
- Amputate: Set sparse values (1 of 4 replicates) to complete missing
- Rationale: Single missing values in a condition with 4 replicates are likely random dropouts (MCAR).
- Fill dense: Mean-fill single missing values per condition
- Rationale: When only one value is missing in a condition, it is reasonable to impute it with the mean of the other replicates, assuming the missingness is random.
- Downshift: Low-value imputation for completely missing features (MNAR)
- Rationale: Features missing in all replicates of a condition are likely below detection limits (MNAR) and are imputed with low values to reflect this.
- k-NN: Neighbor-based imputation for remaining missing values
- Rationale: For remaining missing values that do not fit the above patterns, k-NN imputation leverages similarity between peptides to estimate missing intensities based on neighboring peptide profiles.
Examining the Downshifted Low-Value Imputation Profiles¶
Before determining some of the parameters for downshifted low-value imputation, it is important to understand the low-percentile intensity distributions of the dataset. The two version of figures below examine the 0.1 percentile and 0.05 percentile intensity distributions across all samples combined. Then the downshift magnitude 1.5, and 5 are examined with respectively. This allows me to visualize how much overlap between the downshifted low values and the actual low-intensity values in the dataset. Based on these visualizations, I can make an informed decision on the downshift magnitude to use for imputation.
print("\nChecking the downshifted low-value imputation parameter's impact...")
print(" The magnitude is taking log2 into account if data is already log2, is_log2 needs to be set True.")
shift_mag = 1.5
low_percentile=0.10
print(f" - Using the default values shift magnitude: {shift_mag}, low percentile: {low_percentile}")
plots.downshift_effect(
df=centered_data,
cond_dict=condition_to_samples,
shift_magnitude=shift_mag,
low_percentile=low_percentile,
is_log2=False,
show=True,
save=save_to_folder,
filename="Data_DownshiftedImputation_Effect_default",
fileformats=figure_formats,
filepath=figure_path,
transparent=transparent_bg,
dpi=figure_dpi,
)
shift_mag = 5
low_percentile = 0.05
print(f"\n - Adjusting the shift magnitude to {shift_mag} and low percentile to {low_percentile} for downshifted imputation.")
plots.downshift_effect(
df=centered_data,
cond_dict=condition_to_samples,
shift_magnitude=shift_mag,
low_percentile=low_percentile,
is_log2=False,
show=True,
save=save_to_folder,
filename="Data_DownshiftedImputation_Effect_adjusted",
fileformats=figure_formats,
filepath=figure_path,
transparent=transparent_bg,
dpi=figure_dpi,
)
print(" - Accepting the adjusted parameters for downshifted imputation.")
Checking the downshifted low-value imputation parameter's impact... The magnitude is taking log2 into account if data is already log2, is_log2 needs to be set True. - Using the default values shift magnitude: 1.5, low percentile: 0.1
- Adjusting the shift magnitude to 5 and low percentile to 0.05 for downshifted imputation.
- Accepting the adjusted parameters for downshifted imputation.
The downshift parameter visualization compares default (shift=1.5, percentile=0.10) and adjusted (shift=5, percentile=0.05) settings. The adjusted parameters provide greater separation between imputed low-values and the real data distribution, reducing potential false positive abundance changes from imputed values. The parameters chosen here (shift=5, percentile=0.05) provide a clearer separation between imputed low-values and the actual data distribution. This reduces the risk of imputed values overlapping with real low-intensity measurements, which could lead to false positives in downstream analyses.
Applying Imputation Strategies¶
The multi-strategy imputation is provide in the impute module and requires a list of dictionaries defining the strategies to apply in order. Here the strategies defined above are implemented sequentially to address different missingness patterns in the data.
imputation_pipeline_scheme = [
{ # First is the remove sparsely quantified features to set complete missing for low-value imputation
'method': 'amputate',
'params': { 'min_quantified': 2 } # below 2 meaning if 1 quantified out of 4 replicates, then set to all missing
},
# I generally don't use this step, but it can be useful for some datasets
{ # Second step is generally to fill the sparse missingness with features' mean or median
'method': 'fill_dense',
'params': {'max_missing': 1, 'strategy': 'mean'}
},
# This is the best for complete or mostly missing features
{ # Adds downshifted low-values to completely missing (1.0) or mostly missing (< 1.0) features
'method': 'downshift',
'params': {'missingness_threshold': 1.0, 'shift_magnitude': 5, 'low_percentile': 0.05}
},
{ # Final step is to use k-NN imputation to fill the remaining missing values
'method': 'knn',
'params': {'n_neighbors': 5}
}
]
imputation_pipeline_scheme = [
{ # First is the remove sparsely quantified features to set complete missing for low-value imputation
'method': 'amputate',
'params': { 'min_quantified': 2 } # below 2 meaning if 1 quantified out of 4 replicates, then set to all missing
},
# I generally don't use this step, but it can be useful for some datasets
{ # Second step is generally to fill the sparse missingness with features' mean or median
'method': 'fill_dense',
'params': {'max_missing': 1, 'strategy': 'mean'}
},
# This is the best for complete or mostly missing features
{ # Adds downshifted low-values to completely missing (1.0) or mostly missing (< 1.0) features
'method': 'downshift',
'params': {'missingness_threshold': 1.0, 'shift_magnitude': 5, 'low_percentile': 0.05}
},
{ # Final step is to use k-NN imputation to fill the remaining missing values
'method': 'knn',
'params': {'n_neighbors': 5}
}
]
imputed_data = impute.run_imputation_pipeline(
data=centered_data,
cond_dict=condition_to_samples,
scheme=imputation_pipeline_scheme,
is_log2=False,
return_log2=False,
verbose=verbose
)
# print("\nVisualizating the distribution of imputed and real values:")
# # if number of columns higher than 20, then use
# if len(imputed_data.columns) > 20:
# print(" - Visualize per condition, as there are too many samples to visualize per sample")
# plots.imputation_distribution_per_condition(
# original_data=centered_data,
# imputed_data=imputed_data,
# title="\nDistribution of Imputed and Real Values per Condition",
# is_log2=False,
# cond_dict=condition_to_samples,
# show=True,
# save=save_to_folder,
# filename="Imputation_Distribution_PerCondition",
# fileformats=figure_formats,
# filepath=figure_path,
# transparent=transparent_bg,
# dpi=figure_dpi,
# )
# else:
# print(" - Visualize per sample, as there are not too many samples to visualize per sample")
# plots.imputation_distribution_per_sample(
# original_data=centered_data,
# imputed_data=imputed_data,
# title="\nDistribution of Imputed and Real Values per Sample",
# is_log2=False,
# show=True,
# save=save_to_folder,
# filename="Imputation_Distribution_PerSample",
# fileformats=figure_formats,
# filepath=figure_path,
# transparent=transparent_bg,
# dpi=figure_dpi,
# )
====== Starting Imputation Pipeline ======
Initial missing values: 145644
--- Running Step: Sparse Feature Amputation ---
Parameters: min_quantified=2
Condition 'True Hypoxia (1% O2) - 48 Hr': Amputating 3017 features with < 2 quantified values.
Condition 'True Hypoxia (1% O2) - 72 Hr': Amputating 5478 features with < 2 quantified values.
Condition 'Normoxia (21% O2) - 48 Hr': Amputating 2870 features with < 2 quantified values.
Condition 'Normoxia (21% O2) - 72 Hr': Amputating 6090 features with < 2 quantified values.
Summary: Created 17455 new missing values by amputation.
Total missing values after step 1 ('amputate'): 163099
--- Running Step: Dense Feature Filling ---
Parameters: max_missing=1, strategy='mean'
Condition 'True Hypoxia (1% O2) - 48 Hr': Filling 9961 features with 1 to 1 missing values.
Condition 'True Hypoxia (1% O2) - 72 Hr': Filling 9163 features with 1 to 1 missing values.
Condition 'Normoxia (21% O2) - 48 Hr': Filling 9457 features with 1 to 1 missing values.
Condition 'Normoxia (21% O2) - 72 Hr': Filling 10031 features with 1 to 1 missing values.
Summary: Filled 38612 missing values.
Total missing values after step 2 ('fill_dense'): 124487
--- Running Step: Downshifted Imputation ---
Parameters: missingness_threshold=1.0, shift_magnitude=5, low_percentile=0.05
Condition 'True Hypoxia (1% O2) - 48 Hr': Imputing 5109 features with >= 4 missing values.
Condition 'True Hypoxia (1% O2) - 72 Hr': Imputing 9166 features with >= 3 missing values.
Condition 'Normoxia (21% O2) - 48 Hr': Imputing 4877 features with >= 4 missing values.
Condition 'Normoxia (21% O2) - 72 Hr': Imputing 9349 features with >= 3 missing values.
Summary: Imputed 95489 missing values.
Total missing values after step 3 ('downshift'): 28998
--- Running Step: k-NN Imputation ---
Parameters: n_neighbors=5
Condition 'True Hypoxia (1% O2) - 48 Hr': Applying k-NN to 14412 remaining missing values.
Condition 'Normoxia (21% O2) - 48 Hr': Applying k-NN to 14586 remaining missing values.
Summary: Imputed 28998 missing values.
Total missing values after step 4 ('knn'): 0
====== Imputation Pipeline Finished ======
The multi-strategy imputation pipeline addresses different missingness patterns sequentially: (1) amputate sparse values, (2) mean-fill single missing values per condition, (3) downshift completely missing values, and (4) k-NN imputation for remaining gaps. This approach treats Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR) patterns appropriately.
You can see the imputation results summary above, which shows how many values were imputed by each strategy. The majority of imputations were done using the downshift method for completely missing features, followed by k-NN for remaining gaps. This indicates that many peptides had missing values likely due to being below detection limits in certain conditions.
04.8 Coefficient of Variation Analysis¶
Coefficient of Variation (CV) measures within-condition reproducibility. Lower CV values indicate more consistent measurements across biological replicates, with CV <25% generally considered acceptable for quantitative proteomics.
print("\nCalculating Coefficient of Variation (CV) from proteins and group by condition...")
cv_data = pd.DataFrame(index = imputed_data.index)
# Calculate the coefficient of variation per biological sample
for k, v in condition_to_samples.items():
cv_data[k] = utils.cv_numpy(
imputed_data[v].values,
axis=1,
ignore_nan=True,
format="percent"
)
plot_data = utils.create_cv_group_plot_data(
cv_data=cv_data,
cv_group_palettes=cv_group_palettes
).dropna(subset=['CV'])
plots.cv_comparison(
plot_data=plot_data,
cv_group_palettes=cv_group_palettes,
condition_colors=condition_colors,
main_title="\nCoefficient of Variation (CV) Comparison",
show=True,
save=save_to_folder,
filename="CV_Comparison",
fileformats=figure_formats,
filepath=figure_path,
transparent=transparent_bg,
dpi=figure_dpi,
)
Calculating Coefficient of Variation (CV) from proteins and group by condition...
The CV comparison visualizes measurement reproducibility across experimental conditions. While the table overshadows the boxplots, we can see the mean and median CV values are generally below 25%, indicating good reproducibility. Some outliers with higher CVs exist, which is common in proteomics data. Additionally the proportion of proteins with lower than 10% CV is generally 15-25%, which is a good sign of high-quality data.
05. Complete Final Checks and Save the Data¶
05.1 Final Checks on the Data¶
Before we finish and finalize the cleaned data tables to be saved for downstream analysis, we perform some final checks to ensure data integrity and consistency. This includes verifying that all filtering and normalization steps have been correctly applied, checking for any remaining missing values, and confirming that the metadata accurately reflects the cleaned dataset. Once these checks are complete, the cleaned data tables will be saved to the specified output directory for use in subsequent proteoform analysis steps.
print("Making sure the info_data is aligned with the imputed data (latest version)...")
info_data = info_data.loc[imputed_data.index].reset_index(drop=True)
print()
print("Establish protein data, will be making use in later steps.")
protein_data = info_data.groupby('Protein').agg(
Gene=('Gene', 'first'),
PeptideCount=('Peptide', 'count'),
Coverage=('Cov%', 'mean'),
Length=('Length', 'first'),
Weight_kDa=('Weight(kDa)', 'first'),
Description=('Description', 'first'),
Sequence=('Sequence', 'first')
).reset_index()
print(f"Protein data shape: {protein_data.shape}")
protein_data.head()
Making sure the info_data is aligned with the imputed data (latest version)... Establish protein data, will be making use in later steps. Protein data shape: (7161, 8)
| Protein | Gene | PeptideCount | Coverage | Length | Weight_kDa | Description | Sequence | |
|---|---|---|---|---|---|---|---|---|
| 0 | A0A024RBG1 | NUDT4B | 5 | 30.3867 | 181 | 20.4213 | Diphosphoinositol polyphosphate phosphohydrola... | MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQW... |
| 1 | A0A0B4J2D5 | GATD3B | 7 | 35.4478 | 268 | 28.1247 | Putative glutamine amidotransferase-like class... | MAAVRALVASRLAAASAFTSLSPGGRTPSQRAALHLSVPRPAARVA... |
| 2 | A0A1B0GTU1 | ZC3H11B | 5 | 8.1988 | 805 | 88.8824 | Zinc finger CCCH domain-containing protein 11B | MPNQGEDCYFFFYSTCTKGDSCPFRHCEAALGNETVCTLWQEGRCF... |
| 3 | A0A1W2PQL4 | ZNF722 | 5 | 11.7188 | 384 | 44.5719 | Zinc finger protein 722 | MAERPGSPGSREMRLLTFRDIAIEFSLEEWQCLDCAQQNLYRDVML... |
| 4 | A0A2R8Y4L2 | HNRNPA1L3 | 12 | 44.7273 | 275 | 29.1558 | Heterogeneous nuclear ribonucleoprotein A1-like 3 | MRDPNTKRSRGFGFVTYATVEEVDAAMNARPHKVDGRVVEPKRAVS... |
The alignmend of information is the most crucial. Then a protein data is generate (above), where some info like PeptideCount and Coverage is added to the fasta like table to ease of access.
05.2 Subset of UniProt Annotations¶
The UniProt annotations is a large table, and we don't need it all only the 7161 proteins that we have in our final dataset. Therefore, we subset the UniProt annotations to only include entries corresponding to the proteins present in our cleaned data. This reduces file size and improves efficiency for downstream analyses that utilize these annotations.
print("Filtering UniProt data to match the identified proteins...")
uniprot_data = uniprot_data[uniprot_data['Protein'].isin(protein_data['Protein'].tolist())]
print(f"Filtered UniProt data shape: {uniprot_data.shape}")
print("Renaming UniProt columns for clarity...")
uniprot_data = uniprot_data.reset_index(drop=True)
print("UniProt data columns:", uniprot_data.columns.tolist())
uniprot_data.head()
Filtering UniProt data to match the identified proteins... Filtered UniProt data shape: (774552, 10) Renaming UniProt columns for clarity... UniProt data columns: ['isoform', 'Protein', 'feature', 'start', 'end', 'group', 'agent', 'note', 'description', 'score']
| isoform | Protein | feature | start | end | group | agent | note | description | score | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A0A024RBG1 | CHAIN | 1 | 181 | Diphosphoinositol polyphosphate phosphohydrola... | 3 | ||||
| 1 | A0A024RBG1 | CROSSLNK | 5 | 5 | Ubiquitination/SUMOylation | Ubiquitination/SUMOylation at K5 | Source: psp; Type: UBIQUITINATION | 1 | ||
| 2 | A0A024RBG1 | BINDING | 10 | 10 | 3 | |||||
| 3 | A0A024RBG1 | BINDING | 18 | 20 | 3 | |||||
| 4 | A0A024RBG1 | DOMAIN | 18 | 145 | Nudix hydrolase | 3 |
05.3 Save All the Cleaned and Processed Tables¶
Finally, all cleaned and processed data tables are saved to the designated output directory. They are mainly saved in feather format for efficient i/o, which is well-suited for large tabular datasets commonly encountered in proteomics analyses. They can't be opened and viewed directly like tsv or csv files, but they are much faster to read and write, which is crucial when dealing with large datasets.
print("\n📁 Saving cleaned table...")
cleaned_data.to_feather(f'{output_path}cleaned_data.feather')
print(f"Cleaned data saved to: {output_path}cleaned_data.feather")
print("\n📁 Saving normalized data...")
centered_data.to_feather(f'{output_path}centered_data.feather')
print(f"Centered data saved to: {output_path}centered_data.feather")
print("\n📁 Saving imputed data...")
imputed_data.to_feather(f'{output_path}imputed_data.feather')
print(f"Imputed data saved to: {output_path}imputed_data.feather")
print("\n📁 Saving metadata...")
meta.to_csv(f'{output_path}metadata.csv', index=False)
print(f"Metadata saved to: {output_path}metadata.csv")
print("\n📁 Saving protein data...")
protein_data.to_feather(f'{output_path}protein_data.feather')
print(f"Protein data saved to: {output_path}protein_data.feather")
print("\n📁 Saving UniProt data...")
uniprot_data.to_feather(f'{output_path}uniprot_data.feather')
print(f"UniProt data saved to: {output_path}uniprot_data.feather")
print("\n📁 Saving info data...")
info_data.to_feather(f'{output_path}info_data.feather')
print(f"Info data saved to: {output_path}info_data.feather")
📁 Saving cleaned table... Cleaned data saved to: ./data/cleaned/hypoxia/cleaned_data.feather 📁 Saving normalized data... Centered data saved to: ./data/cleaned/hypoxia/centered_data.feather 📁 Saving imputed data... Imputed data saved to: ./data/cleaned/hypoxia/imputed_data.feather 📁 Saving metadata... Metadata saved to: ./data/cleaned/hypoxia/metadata.csv 📁 Saving protein data... Protein data saved to: ./data/cleaned/hypoxia/protein_data.feather 📁 Saving UniProt data... UniProt data saved to: ./data/cleaned/hypoxia/uniprot_data.feather 📁 Saving info data... Info data saved to: ./data/cleaned/hypoxia/info_data.feather
Summary¶
This notebook prepared peptide-level proteomics data from H358 NSCLC cells under hypoxia treatment for proteoform analysis with ProteoForge.
Output Files¶
| File | Description |
|---|---|
cleaned_data.feather |
Data after outlier removal and filtering |
centered_data.feather |
Median-centered normalized intensities |
imputed_data.feather |
Complete dataset after multi-strategy imputation |
metadata.csv |
Sample metadata with condition and duration |
protein_data.feather |
Protein-level summary (gene, coverage, length) |
uniprot_data.feather |
Filtered UniProt annotations |
info_data.feather |
Peptide-to-protein mapping with sequences |
Key Processing Steps¶
- Data Loading: Imported DIA-NN output, UniProt sequences, and annotations
- Sequence Mapping: Mapped peptide positions and calculated sequence coverage
- Quality Filtering: Removed proteins with <4 peptides, identified outlier samples
- Normalization: Applied median centering with original magnitude preservation
- Imputation: Multi-strategy approach (amputate → fill dense → downshift → k-NN)
- Quality Assessment: Evaluated CV distribution across experimental conditions
The prepared data is now ready for ProteoForge proteoform detection analysis.
print("Notebook Execution Time:", utils.prettyTimer(utils.getTime() - startTime))
Notebook Execution Time: 00h:02m:38s